Sound system improving speech intelligibility

ABSTRACT

The invention relates to a method and a device for improving speech intelligibility for a listener receiving a speech signal output through a transducer in a noisy environment, where in the speech signal prior to the output one or more parameters have been modified in a signal processor corresponding to what a speaking person would normally do when speaking in a noisy environment or when speaking clearly.

AREA OF THE INVENTION

The invention relates to sound delivery systems, where a sound source isdelivering a sound signal to a listener. More specifically the inventionrelates to a method for improving the intelligibility of the outputsignal in such sound delivery systems as well as a sound delivery systemimplementing the method.

BACKGROUND OF THE INVENTION

In many situations a speech signal is output to a listener, where thelistener is in a noisy environment and where the speech signaloriginates as a signal performed in a silent or at least less noisyenvironment than the location of the listener.

Examples of such situations include telephone communication situations,where one telephone device is located in a noisy environment and anotheris in a quiet environment, ATM dispensing situations and similarsituations, where a voice instruction is given automatically or uponrequest and where the environment may be noisy.

The objective of the present invention is to provide a remedy for thenoisy listening situations where a listener may have difficultiesunderstanding a voice message spoken or recorded in quiet conditions.

Vocal effort signifies the way normal speakers adapt their speech tochanges in background noise, acoustic environment or communicationdistance. Specifically, vocal effort provoked by changing backgroundnoise is often referred to as the Lombard reflex, -effect or -speechafter the French ENT-doctor E. Lombard (Lombard, 1911—see also Sullivan,1963).

Similarly, ‘clear speech’ signifies the way normal speakers may adapttheir speech when they want to improve speech intelligibility in variousacoustical backgrounds (Krause & Braida, 2002).

Speech spoken with different vocal efforts can perceptually beclassified into being soft, normal, raised, loud or shouted. However, inthe scientific literature other classification labelling can also befound.

Variation in vocal effort is physiologically associated with changes inthe airflow through the glottis, in the movements of the vocal cords, inthe muscles of the pharynx, and in the shape of the vocal tract(Holmberg et al., 1988 & 1995; Ladefoged, 1967; Schulman, 1989;Södersten et al., 1995).

Perceptual experiments have demonstrated that speech produced withincreased vocal effort is more intelligible than normal speech (Summerset al., 1988). It thus appears that speakers attempt to maintain analmost constant level of speech intelligibility when the informationbecomes degraded by environmental noise.

The most salient feature of vocal effort is probably the changes in theall-over amplitude and spectral characteristics of the speech signal.Pearsons et al. (1978) first described this in detail for face to facecommunication in background noise and these results has later beenincluded in the Speech Intelligibility Index—standard (ANSI, 1997).Pearsons et al. found that all-over speech level increasessystematically about 0.6 dB/dB as a function of background noise level.However, a more significant effect was found at higher-frequencies (aspectral tilt) resulting in an increase of about 0.8 dB/dB in the 1-3kHz area. Others have made similar qualitative findings (Childers & Lee,1991; Granström & Nord, 1992; Gauffin & Sundberg, 1989; Liénard & DiBenedetto, 1999). Since most background noises are dominated by lowfrequency energy, the speech changes associated with vocal effortattempt to maintain the audibility of the high frequency speech elementseven in adverse signal-to-noise ratios. Normally, speech information ishighly redundant, so if audibility of the high frequency speech elementsis maintained when communicating in background noise, adequate speechintelligibility will be ensured for people with normal hearing.

Besides the all-over amplitude and spectral changes described above, aseries of other acoustic-phonemic features are also influenced by vocaleffort. The following changes to increased vocal effort have beenreported in the literature: decrease in rate of speaking (Hanley &Steer, 1949), increase of the pitch frequency, F ₀ , and of the firstformant frequency, F ₁, (Bond et al., 1989; Draegert, 1951; Junqua,1993; Liénard & di Benedetto, 1999; Loren et al., 1986; Rastatter &Rivers, 1983; Summers et al., 1988), increase in vowel duration anddecrease in consonant duration (Bonnot & Chevrie-Muller, 1991; Fónagy &Fónagy, 1966, Rostolland, 1982, Traunmüller & Eriksson, 2000), anddecrease in consonant/vowel energy ratio (Fairbanks & Miron, 1957;Junqua, 1993).

Both acoustical and perceptual analysis suggests that the Lombard effectworks differently in male and female speakers. This gender effect hasbeen studied systematically by Junqua (1993).

In Summary

The following acoustic-phonetic speech features appear to be affected byvocal effort:

-   level-   frequency spectrum-   rate of speaking-   pitch, F₀-   formant frequency, F₁-   vowel and consonant duration-   consonant/vowel energy ratio    and the observed changes are gender-specific.

SUMMARY OF THE INVENTION

According to the invention the objective of the invention is achieved bymeans of the method as defined in claim 1.

By means of such modification of the output signal the intelligibilitywill be improved for the listener being in a noisy environment. Not alltypes of environmental noise will affect speech communication to thesame extent. For example, a very low frequency noise signal will notaffect the information in the speech signal (which is limited tofrequencies above 100 Hz) although the sound level alone would indicateso. Therefore, not all noise types should activate a vocal effortprocessor as defined in claim 1 in the same way, and by monitoringparameters other than all-over sound level would guide the function ofthe vocal effort processor to an appropriate response to different noisetypes.

Preferably at least one between the following parameters of speech ismodified: level, frequency spectrum, rate of speaking, pitch F₀, one ormore formant frequencies F₁, F₂, . . . , vowel and consonant duration,consonant/vowel energy ratio.

According to the invention the objective of the invention is achieved bymeans of the sound delivery system as defined in claim 3.

By means of such modification of the output signal the intelligibilitywill be improved for the listener being in a noisy environment.

The invention will be described in more detail in the followingdescription of embodiments, with reference to the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic drawing showing an example of a sound deliverysystem where the invention may be implemented,

FIG. 2 is a schematic drawing showing a further example of a sounddelivery system where the invention may be implemented.

DESCRIPTION OF A PREFERRED EMBODIMENT

The embodiment is characterised by the transmitter and the receiver of acommunication channel being located in two environments with differentenvironmental background noise conditions. Thus conditions for producingspeech in environment 1 and the conditions in environment 2 forlistening to the speech will be different. If the speaker and listenerwere in the same environment, the speakers voice would adapt to thelevel of the background noise—the vocal effort would be activated—andthis ensures that a normal hearing listener could understand what thespeaker is saying.

However when the speaker and listener are not in the same environment,the background noise of environment 2 will not normally activate vocaleffort with the speaker in environment 1. It is the idea of presentinvention to artificially produce the missing vocal effort, of thespeaker in environment 1 so as to ease the understanding of the listenerin environment 2.

In the embodiment shown on FIG. 1 the sound is either picked up directlyfrom the speaker, synthesised from text or other input or it ispre-recorded and stored for later use. At request or on-line the speechis then sent to environment 2, where the intended listener is located.The speech can be sent in the communication channel either as ananalogue signal, a digital signal or as parameters of a speech or audiocodec.

From the speech received by the receiver a number of parameterscharacterising the incoming speech signal is deduced by “Pre-processor1”. These parameters are compared to a similar set deduced fromenvironment 2 by pre-processor 2 in a vocal effort processor, which thenadds vocal effort to the incoming speech signal if necessary. Theparameters deduced by pre-processor 1 and 2 could be level, frequencytilt and long term spectrum, Voice Activity Detection (VAD) and Speechto Noise Ratio (SpNR).

Given the SpNR of the incoming signal (environment 1) and the SpNR ofenvironment 2, it is possible to correct the incoming signal for thedegree of lack of vocal effort, so that the listener in environment 2more easily hears it.

The addition of vocal effort to the incoming signal can be done inseveral ways. A first order approach is to only correct for level andfrequency spectrum. As a second order approach the duration and heightof vowels and consonants can also be addressed. The addition of vocaleffort can either be done directly in the vocal effort processor or inthe receiver, as indicated by parameters sent from the vocal effortprocessor to the receiver.

For applications involving the first order approach the addition of thevocal effort could typically be performed in the vocal effort processoritself. For applications involving the second order approach, thistypically involves the use of a speech or audio codec, so therefor itwould be more straight forward to let the vocal effort processor modifythe parameters of the incoming speech so that the receiver itself wouldresynthesize the speech with the vocal effort. This latterimplementation approach makes the invention more computationallyefficient, if implemented in digital technology and thus also more powerefficient.

In a second preferred embodiment shown on FIG. 2 pre-recorded speech orparameters of speech, for instance for speech synthesis is stored in astorage means in a device, for instance a bank terminal, touristinformation terminal or other devices placed in an environment in whichambient noise levels often are problematic. The speech or parameters ofspeech, for instance for speech synthesis stored in the storage meansdoes not contain vocal effort. So if this is needed for propercommunication in the environment, for instance due to a high level ofambient noise, it becomes difficult for the user of the device tounderstand the message from the device. It is the idea of the inventionto artificially produce the missing vocal effort, of the speech from thedevice, so as to ease the understanding of the user.

From the signal received by the pre-processor (from the microphone) anumber of parameters characterising the incoming signal is deduced by apre-processor, as described in connection to the first exampleembodiment. These parameters are compared to predefined values or a setof rules, indicating when vocal effort is necessary. The vocal effortprocessor then adds vocal effort to the speech signal whenever it isnecessary.

The speech can be sent to the transmitter either as an analogue signal,a digital signal or as parameters of a speech or audio codec. In thefirst two cases, the transmitter becomes a simple analogue or digitalamplifier and in the last case the speech parameters are first used tosynthesise a speech signal before it is amplified and sent to the vocaleffort processor.

In an alternative embodiment—in stead of adding the vocal effort afterthe speech is recorded or synthesised, it could also be possible tostore different versions of the speech or parameters for speechsynthesis, which include different levels of vocal effort. Theseversions could then be used so that they match the ambient noise level,and the user then listens to a signal with the proper amount of vocaleffort.

In another embodiment, the device uses online speech recognition torecognise the input from the user. The message from the device is thenthe response to what the user just said. In that connection, the devicecould use the information regarding the ambient noise level, and otherparameters of the environment to decide how to recognise the speech. Itis well known from the literature, that some features extracted fromspeech are more noise robust than others. So when no or little noise ispresent it is not necessary to perform speech recognition with a largefeature set, only a subset of the feature set is used. However as theambient noise increases in level or becomes more disturbing for thespeech recogniser, a larger feature set, including more noise robustfeatures of speech is used.

The embodiment shown on FIG. 1 could be implemented in a mobile phone.This could be done in a number of ways, including modification of theparameters of the synthesis filter, modification of the function of thede-emphasis filter or simply by adding a separate filter after thesynthesis filter. The information necessary for estimating the speech tonoise ratio, SpNR, in both environments, to be used for estimating alack of vocal effort for one of the listeners, could be computed in thevoice activity detection, VAD, part of the speech codec. In the VAD asubstantial amount of the information needed to estimate the SpNR isalready available, for instance in GSM-phones today. By adding to thisan estimate of the modulation in the observed signal, an estimate of theSpNR. Since the addition of the vocal effort is only relevant whenspeech is present, the use of the VAD output can be used to turn thevocal effort processing on and off, as it is done for the speech codesin GSM-phones today.

The embodiment shown on FIG. 2 has been implemented on a stand-alone PC,equipped with a standard sound card, and a database of pre-recordedutterances stored in the storage shown on the figure. In this case, thetransmitter is a simple decoder, capable of reading the encodeddigitized utterances from the storage. Once a selected utterance isconverted in the transmitter to a series of digital voice samples, thevocal effort processor processes the digital speech samples by means ofa digital FIR-filter. The amount of amplification and spectral shape ofthe FIR-filter is controlled by the pre-processor. The pre-processorcalculates an estimate of the L_(eq) of the digitized signal from themicrophone in 6 octave bands with midband frequencies 0.25, 0.5, 1, 2,4, 8 kHz. The estimate of the L_(eq) is continuously updated. By meansof the L_(eq)'s which are interpreted as a coarse estimate of theambient noise spectrum, the amount of vocal effort to apply to thespeech signal is determined by means of a look-up table. The look-uptable defines standard speech spectrum levels for different vocaleffort, ranging from normal over raised and loud to shout. Bycalculating the difference between the ambient noise spectrum and thecorresponding spectrum of speech at that ambient noise level, as definedby the look-up table, the gain and frequency spectrum of the FIR-filterof the vocal effort processor is calculated. Finally the calculatedfilter characteristics are applied to the FIR-filter of the vocal effortprocessor, which then changes the vocal effort of the pre-recorded voiceutterances to match the ambient noise level.

The standard speech spectrum levels for different degrees of vocaleffort, is listed in the table below. TABLE 1 Octave band speechspectrum - frequencies and standard speech spectra. Nominal Standardspeech spectrum level for Band midband stated vocal effort, dB No.freq., Hz. Normal Raised Loud Shout 1 250 34.75 38.98 41.55 42.50 2 50034.27 40.15 44.85 49.24 3 1000 25.01 33.86 42.16 51.31 4 2000 17.3225.32 34.39 44.32 5 4000 9.33 16.78 25.41 34.41 6 8000 1.13 5.07 11.3920.72

Source: SII-procedure, ANSI S3.5 1997.

REFERENCE LIST

-   ANSI S3.5 (1997). ‘Methods for calculation of the speech    intelligibility index’. American National Standard.-   Bond, Z. S., Moore, T. J. and Gable, B. (1989). ‘Acoustic-phonetic    characteristics of speech produced in noise and while wearing an    oxygen mask’. J. Acoust. Soc. Am. 85, 907-12.-   Bonnot, J-F. P. and Chevrie-Muller, C. (1991). ‘Some effects of    shouted and whispered conditions on temporal organization of    speech’. J. Phonetics 19, 473-83.-   Childers, D. G. and Lee, C. K. (1991). ‘Vocal quality factors:    Analysis, synthesis, and perception’. J. Acoust. Soc. Am. 90,    2394-2410.-   Draegert, G. L. (1951). ‘Relationships between voice variables and    speech intelligibility in high noise levels’. Speech Monogr. 18,    272-78.-   Fairbanks, G. and Miron, M. (1957). ‘Effects of vocal effort upon    the consonant-vowel ratio within the syllable’. J. Acoust. Soc. Am.    29, 621-6.-   Fónagy, I. and Fónagy, J. (1966). ‘Sound pressure level and    duration’. Phonetica 15, 14-21.-   Gauffin, J. and Sundberg, J. (1989). ‘Spectral correlates of glottal    voice source waveform characteristics’. J. Speech Hear. Res. 32,    556-65.-   Granström, B. and Nord, L. (1992). ‘Neglected dimensions in speech    synthesis’. Speech Commun. 11, 459-62.-   Hanley, T. D. and Steer, M. D. (1949). ‘Effect of level of    distracting noise upon speaking rate, duration and intensity’. J.    Speech Hear. Disord. 14, 363-8.-   Holmberg, E. B., Hillman, R. E. and Perkell, J. S. (1988). ‘Glottal    airflow and transglottal air pressure measurements for male and    female speakers in soft, normal and loud voice’. J. Soc. Acoust. Am.    84, 511-29.-   Holmberg, E. B., Hillman, R. E., Perkell, J. S., Guiod, P. C. and    Goldman, S. (1995). ‘Comparisons among aerodynamic,    electroglottographic, and acoustic spectral measures for female    voice’. J. Speech Hear. Res. 38, 1212-23.-   Junqua, J. C. (1993). ‘The Lombard reflex and its role on human    listeners and automatic speech recognizers’. J. Acoust. Soc. Am. 93,    510-24.-   Krause J. C. and Braida L. D. (2002). ‘Investigating alternative    forms of clear speech: The effects of speaking rate and speaking    mode on intelligibility’. J. Acoust. Soc. Am. 112, 2165-2172.-   Ladefoged, P. (1967). ‘Three Areas of Experimental Phonetics’.    Oxford U. P., London.-   Liénard, J-S. and Di Benedetto, M-G. (1999). ‘Effect of vocal effort    on spectral properties of vowels’. J. Acoust. Soc. Am. 106, 411-22.-   Lombard, E. (1911). ‘Le Signe de l'Elevation du Voix’. Ann.    Maladiers Oreille, Larynx, Nez, Pharynx 37, 101-19.-   Loren, C. A., Colcord, R. D., and Rastatter, M. P. (1986). ‘Effects    of auditory masking by white noise on variability of fundamental    frequency during highly similar productions of spontaneous speech’.    Percept. Mot. Skills 63, 1203-6.-   Pearsons, K. S., Bennett, R. L. and Fidell, S. (1978). ‘Speech    levels in various environments’. Bolt, Baranek and Newman Report    3281.-   Rastatter, M. P. and Rivers, C. (1983). ‘The effects of short-term    auditory masking on fundamental frequency variability’. J. Aud. Res.    23, 33-42.-   Rostolland, D. (1982). ‘Acoustic features of shouted speech’.    Acoustica 50, 118-25.-   Schulman, R. (1989). ‘Articulatory dynamics of loud and normal    speech’. J. Acoust. Soc. Am. 85, 295-312.-   Sullivan, R. F. (1963). ‘Report on Dr. Lombard's original research    on the voice reflex test’. Acta. Otolaryngol. 56, 490-2.-   Summers, W. Van, Pisoni, D. B., Bernacki, R. H., Pedlow, R. I., and    Stokes, M. A. (1988). ‘Effect of noise on speech production:    Acoustic and perceptual analyses’. J. Acoust. Soc. Am. 84, 3,    917-28.-   Södersten, M., Hertegärd, S. and Hammarberg, B. (1995). ‘Glottal    closure, transglottal air-flow, and voice quality in healthy    middle-aged women’. J. Voice 9, 182-97.-   Traunmüller, H. and Eriksson, A. (2000). ‘Acoustic effects of    variation in vocal effort by men, women, and children’. J. Acoust.    Soc. Am. 107, 6, 3438-51.

1. A method of improving speech intelligibility for a listener receivinga speech signal output through a transducer in a noisy environment,where in the speech signal prior to the output one or more parametershave been modified in a signal processor corresponding to what aspeaking person would normally do when speaking in a noisy environmentor when speaking clearly.
 2. A method according to claim 1, where atleast one between the following parameters is modified: level, frequencyspectrum, rate of speaking, pitch F₀, formant frequencies, F₁, F₂, . . .vowel and consonant duration, consonant/vowel energy ratio
 3. A devicefor improving speech intelligibility for a listener receiving a speechsignal output through a transducer in a noisy environment, where in thespeech signal prior to the output one or more parameters have beenmodified in a signal processor corresponding to what a speaking personwould normally do when speaking in a noisy environment or when speakingclearly.
 4. A device according to claim 3, where at least one betweenthe following parameters is modified: level, frequency spectrum, rate ofspeaking, pitch F₀, formant frequencies, F₁, F₂, . . . vowel andconsonant duration, consonant/vowel energy ratio.